Shape-based peak identification for ChlP-Seq 

Valerie Hower, Steven N. Evans, and Lior Pachter 
University of California, Berkeley 



^**^ Abstract. We present a new algorithm for the identification of bound 

fN^ regions from ChlP-seq experiments. Our method for identifying statis- 

tically significant peaks from read coverage is inspired by the notion of 
persistence in topological data analysis and provides a non-parametric 
approach that is robust to noise in experiments. Specifically, our method 
reduces the peak calling problem to the study of tree-based statistics 
derived from the data. We demonstrate the accuracy of our method on 
existing datasets, and we show that it can discover previously missed re- 

^ gions and can more clearly discriminate between multiple binding events. 

' ^ . The software T-PIC (Tree shape Peak Identification for ChlP-Seq) is 

available at 
[http : //math . berkeley . edu/~ vhower/tpic . html] 
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1 Introduction 

> 

^T) With rapidly decreasing costs of sequencing, next-generation sequencing assays 

0^ are increasingly being used for molecular measurements [28j. These techniques 

C ' generate millions of short reads and massive data sets, making it computation- 

^— ^ ally challenging to properly analyze the data. One such assay, called ChlP-Seq 

VO (chromatin immunoprecipitation followed by sequencing), is used to determine 

^^ DNA binding sites of a protein (see [2l[23] for a review). In ChlP-Seq, protein is 

^^ first cross-linked to DNA and the fragments subsequently sheared. Following a 

• • size selection step that enriches for fragments of specified lengths, the fragments 

^j^ ends are sequenced, and the resulting reads are aligned to the genome. Reads 

S> pile up at bound regions, but due to the "noise" inherent in the assay, calling 

%^ "peaks" is not a straightforward task. 

While there are many current algorithms for analyzing ChlP-Seq data ^7[IT3| 
[I71[18l[ll|20l[22l[25l[27l[29], there is still room for improvement as most rely on 
ad- hoc heuristics including height thresholds and simplistic filters. We present a 
new approach for calling peaks that uses not only the height but also the shape 
of a putative peak. We wish to quantify "peakness" of the data while avoiding 
too many heuristics. Specifically, we use shape to differentiate between random 
and nonrandom fragment placement on the genome in a statistically significant 
way. We compare our predictions to those made by PeakSeq [27| and MACS [29] 
using two published data sets. 
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2 Methods 

2.1 Overview of the algorithm 

The input to our algorithm consists of the aligned reads for both the sample 
and input control. We create a 'coverage function' — a map / from the genomic 
coordinates to the nonnegative integers — by extending each of the aligned sam- 
ple reads to the average fragment length L. The 'height' /(to) at a nucleotide to 
is the number of extended reads that contain t^. This coverage function is the 
data that we analyze. Motivated by current work in topological data analysis 
(TDA) [9J , we determine protein binding sites based on the shape of the coverage 
function. Our measure of shape comes from associating a rooted tree with each 
putative peak; a rooted tree is a type of graph that has a vertex specified as 
the "root" and all edges directed away from the root. This tree captures the im- 
portant local features of the coverage function (see [10 for further details). We 
summarize shape using a numerical descriptor of the tree: a tree shape statistic 
Ml that differentiates random from nonrandom coverage. 

Our work uses the theory developed in [10] to model random fragment place- 
ment on a genome. This theory yields Galton- Watson trees with generation de- 
pendent offspring distributions (see |T2j[T4l[T6j for background) that describe the 
shape of the coverage function and depend on the rate of the reads being aligned. 
Using the rate # Q ^f^ ^ mappe ^^ analyze data on the entire genome would be 

^ length or genome ^ " 

inappropriate. In the absence of binding, some genomic regions receive a large 
number of fragments while others receive very few [24]. Hence, we first divide 
the genome into regions where we expect the background to be homogeneous 
across each region and calculate a rate for each region separately. 

Then, for each region i?, we obtain a collection of subtrees/possible peaks 
from the segments in the set 

S = {teR I fit) > hn}, 

for a height Hr (a segment is a subset of S consisting of contiguous nucleotides). 
We use segments with at least 10 base pairs and build trees from the coverage 
function along each segment. Next, we simulate 10, 000 random trees with root 
at height hn^ and we record our tree shape statistic A^ for each tree. We use the 
resulting empirical distribution to obtain an estimate of the function 

Gnim) := F{M{T) > m | T is a random tree from region R}. 

We then compute M.{T) for each tree T in R and assign the probability 
GR{Ai{T)) to T. Once we have assigned probabilities to all subtrees obtained 
from the data, we account for multiple hypothesis testing using a Benjamini- 
Hochberg correction [3 . The significant trees are called as peaks, and we merge 
two called peaks in bordering regions provided the gap between them is less than 
L. Figure [l] gives a pictorial sketch of our method. 
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Fig. 1. In our method, aligned reads are extended to the average fragment length 
(for single end sequencing), and a coverage function records the number of extended 
reads containing each base pair. Trees capturing the shape of the coverage function are 
constructed and a tree shape statistic measuring the size of a maximal matching M is 
computed. By comparison to a null model derived from the expected shape of random 
trees, significant peaks are identified. 
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2.2 The shape of random fragment placement 

As illustrated in Figure [2] and discussed in [TOl [TTj, there is an equivalence 
between lattice path excursions (starting and ending at height h) and rooted 
trees with root at height h. We use this correspondence and simulate trees coming 
from random fragment placement using lattice path excursions for a discrete-time 
Markov chain Z on the nonnegative integers. 




Fig. 2. An example of lattice path excursion (A) and its associated rooted tree (B) 
is given. The rooted tree is obtained by taking equivalence classes of vertices in (A), 
as explained in p^ [TI] . The vertices in (A) that are chosen representatives for the 
equivalence classes are depicted with blue stars. 



The Markov chain Z approximates the jump skeleton of a coverage function 
created from random fragment placement ^10^ and has transition probabilities 



P(i,j) 



'1, if z = and j = 1, 

p{i), if z > 1 and j = i + 1, 

1 — p(i), if z > 1 and j = z — 1, 

^0, otherwise, 



(1) 



where 



p{k) 



„kOw 



dw for k > 1 



and = pL is the expected height of the coverage function created with rate p 
and mean fragment length L. We integrate by parts and obtain the recursion 



pik) = 1 - 
p(i) 



ip(fc-i), 
1-4 + ^ 



fc>2, 



which allows one to solve for p{k). In our simulation, we first step to height 
h + 1 and then either return to height h with probability 1 — p{h + 1) or step to 
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height /i + 2 with probabihty p{h + 1). We continue visiting states of the Markov 
chain — ^jumping up or down at state i depending on the probabihty p{i) — until 
we return to state h. This lattice path excursion corresponds to a simulated tree 
with root at height h. 

2.3 A tree shape statistic Al to measure "peakness" 

In order to select an appropriate tree statistic, we consider the extremes. Figure 
[3] depicts the jump skeletons and corresponding rooted trees for, respectively, 
a perfect peak and perfect noise. We select a tree statistic A4 that attains its 
extremal values on the path P^ and the star graph Sn on n vertices. 




Fig. 3. Two extremal trees are represented — the path Pio (A) and star graph Sio (B) 
on 10 vertices (blue vertices and green edges) — together with the jump skeleta (black 
vertices and edges) that give rise to the trees. 



A matching of a tree T is a subset M of edges of T so that no two edges in 
M share a common vertex of T. A matching M is a maximal matching provided 
\M\ > |M'| for any other matching M' of T. We define A^(T) = |M| for M a 
maximal matching of T, and we have A^(T) < [|J = M{Pn) and M{T) > 1 = 
A4{Sn) for any tree T with n vertices. In our implementation, we calculate the 
tree shape statistic M. using the algorithm in [5]. 

2.4 Subdividing the genome into regions 

We subdivide the genome into regions based on the input control and perform 
our analysis on each region separately. Given the input, we calculate a local rate 
function 

# of input tags originating in [t — 500, t + 500] 

^^^' = 1000 ■ 

We then discretize ( into a step function as follows. For each chromosome, we 
begin with the interval / = [l,i^], where iC is a user specified integer, and find 
the average of ( over /. We extend /, adding nucleotides i^+1, i^+2, • • • , to until 
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C(to + 1) differs from the computed average ( by more than a fixed user specified 
value D. The next interval begins as [to + 1, to + i^], and it is extended until ( 
jumps away from its average by more than D. For the human genome, we use 
K = 10^ 000, but one could use a smaller K for shorter genomes. Additionally, 
we use D = 5. Once all the intervals are determined for all chromosomes, we 
round each average ( to the nearest integer and define (disconnected) regions Rj 
based on the intervals whose average ( rounds to j. We calculate the local rate 

# of tags in data originating in Rj 

Pj ~ 



Y^ length(/) 



for the data along Rj. 



Remark 1. In the absence of input control, one could account for some of the 
rate variability by dividing the genome into intervals of a fixed size and using a 
rate p that depends on the mappability of the interval, 

# of tags originating in interval 



# of uniquely mappable bases in interval 
The mappability of an interval can be computed using the methods in [26] . 

2.5 Choosing the height Hr for a region R, 

Care must be used when selecting Hr^ which depends on the rate calculated for 
region R. If Hr is too low, our simulation will yield trees that are too broad 
and impractical to deal with. Additionally, our called peaks will be very wide. If 
Hr is too high, our simulated trees will be essentially trivial and will not refiect 
any information on shape. For a fixed /i, define S{h) to be the expected length 
of an excursion above height h. This is the expected return time for state h in 
the Markov chain Z with transition probabilities P(i^j) from Equation (IT]). We 
calculate S{h) using the fact S{h) = :^;At, where tt is the stationary distribution 
given by the recurrence 

^^^^ \7i{i - l)P{i - 1, i) + 7r{i + l)P{i + 1, z), if i > h. ^ ' 

These equations have a unique solution up to a multiplicative constant, and that 
constant is determined by the condition that ^^7r{i) = 1 [HI §6.4]. If 6 is the 
expected height of the coverage function on i?, then we define 

Hr := max ( \0], min (h) 

V sih)<c^ _ 

where C is a parameter that keeps the simulated trees to a reasonable size. Due 
to the great variability in the return times for Z, we use C = 7 in our analysis. 
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2.6 Correcting for multiple hypotheses 

With a = 0.01 as the significance level, we use a Benjamini-Hochberg correction 
[SI H] for multiple hypothesis testing. Once we have assigned probabilities to the 
A^ subtrees found on the entire genome, we order the probabilities from least to 
greatest p(i) < p(2) ^ • • • ^ P{n)- Let J be the largest j such that p(j) < ^. 
Then, a tree T in a region i? is a called peak provided Gr{M.{{T)) < ^. 

3 Results 

3.1 Predicting STATl binding sites as compared to PeakSeq 

Our method predicts 90,704 binding sites for STATl on the human genome 
using the Gerstein data [26 . We additionally used PeakSeq's publicly available 
code [26 with the default parameters to predict binding sites for the same data. 
PeakSeq's method involves a two-pass approach [27]. Initially, PeakSeq finds 
122,348 potential binding sites (66.8% of these overlap our peaks). In the second 
pass, PeakSeq identifies the significant peaks based on a parameter Pf with 

< Pf < 1. The most conservative choice is Pf = in which case PeakSeq calls 
30,049 binding sites (93.8% overlap our peaks). On the other extreme, using Pf = 

1 yields the most liberal peak calling with 63,843 binding sites (86.2% overlap 
with our peaks). Of the 36,743 of our peaks that PeakSeq did not find with either 
normalization, 14.1% (resp. 6.9%) are within 1000 bp of one of PeakSeq's called 
peaks for Pf = 1 (resp. Pf = 0). Table fl] and Figure [4] compare our predicted 

Table 1. Comparison with PeakSeq [27] and Gerstein's STATl data 



Mean # of % Near 


% Overlapping % Overlapping % Containing 


Length Peaks Promoter^ 


Exon 


Gene 


TTCNNNGAA* 


First Pass 363.4 122,348 12.3 


14.2 


53.9 


17.2 


Pf = 1 550.4 63,843 17.3 


19.3 


53.0 


27.7 


Pf = 642.9 30,049 14.0 


16.5 


55.5 


42.2 


T-PIC 807.7 90,704 14.9 


18.7 


53.3 


31.1 


Random 807.7 90,704 2.5 


7.8 


43.8 


22.3 



^ 'Near promoter' is within 1,000 bp upstream and 200 bp downstream as in ^ 
^binding motif for STATl [6] 



binding sites with PeakSeq's in terms of the genomic location of peaks. We also 
compare our predictions to random sequences. These are created by taking a 
random start site in the same chromosome and the same length for each of 
our called peaks. Additionally, Table [l] shows the number of peaks containing 
STATl 's binding motif. Even though we predict more significant peaks than 
PeakSeq does, we outperform PeakSeq (for at least one choice of Pf) for the 
percentages in Table [l] shown in bold. This suggests that our additional peaks 
contain true binding sites overlooked by PeakSeq's method. 
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Fig. 4. The relationships between genomic locations of peaks called by T-PIC and by 
PeakSeq with normalization parameters P/ = and P/ = 1 are depicted with Venn 
Diagrams. We additionally show the Venn Diagram for random sequences having the 
same number and length from each chromosome as our called peaks but with a random 
start site. The circular regions include peaks overlapping genes (G), overlapping exons 
(E), and near promoter (P). 



Figure |5] gives two examples of how our called peaks differ from those pre- 
dicted by PeakSeq. One example (Figure IsK) contains one of our peaks that 
lies near a promoter. PeakSeq's initial pass through the data did not select this 
peak of height seven for further analysis. Indeed, PeakSeq's height cutoff for this 
million bp in chromosome 2 was eight. Figure [5^ depicts a significant peak for 
PeakSeq with normalization parameter Pf = 1. This peak overlaps an EST but 
no known genes, and the tree we found on this interval did not have a statistically 
significant shape. 
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Fig. 5. Two examples that illustrate differences between peaks called by T-PIC and 
those predicted by PeakSeq. The coverage function for both the data and the input 
control are plotted for each example. We call the peak in (A) which is near a pro- 
moter and not found by PeakSeq. PeakSeq's significant peak (B) overlaps an expressed 
sequence tag (EST) but no known gene and was not predicted by T-PIC. 
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3.2 Predicting binding sites for Drosophila Melanogaster as 
compared to MACS 

With our algorithm, we predict binding sites for six transcription factors (with a 
total of 8 antibodies) for Drosophila Melanogaster. We use published data from 
the Eisen lab [8 (available at the NCBI GEO database [1 , accession GSE20369) 
and compare our method to MACS [29 . Peaks are called with MACS using the 
parameters from Bradley et al.'s original study [8 . Tables |2] and [s] give summaries 
of the peaks called by MACS and T-PIC. 

Table 2. MACS predictions for Eisen's Drosophila melanogaster data. 

Protein Mean zfjz oi % Found % Near % Overlapping % Overlapping 

Length Peaks By Our Method Promoter^ Exon Gene 

75.4 

75.2 

60.8 

74.1 

72 

64.9 

66 

, 65^7 

'Near promoter' is within 1,000 bp upstream and 200 bp downstream as in ^ 



Table 3. T-PIC predictions for Eisen's Drosphila melanogaster data. 

Protein Mean # of # Not Found # w/i 1000 bp % Near % Overlapping % Overlapping 
Length Peaks by MACS of MACS peak Promoter^ Exon Gene 



bed 


1,639 2,048 99 


cad 


1,594 4,271 96.6 


gt 


1,172.9 2,788 90.7 


hbl 


1,832.5 4,715 97.5 


hb2 


1,449.7 4,415 95.4 


kni 


1,983.9 535 85.2 


krl 


1,529.7 6,391 99.5 


kr2 


1,473.1 6,160 98.6 



42.3 


59.2 


50.3 


55 


27.9 


31.4 


45.8 


51.4 


42.2 


46.8 


40 


44.7 


35.9 


42.1 


34.5 


41.1 



bed 


661.9 


16,262 12,591 


498 


22.2 


45.2 


64.8 


cad 


942 


8,231 3,019 


336 


36.7 


44.5 


66.1 


gt 


881.4 


4,774 1,831 


123 


21.5 


26.1 


57.7 


hbl 


981.3 


7,497 1,867 


229 


35.2 


38.0 


65.7 


hb2 


923.6 


6,528 1,675 


221 


34.0 


37.1 


65.2 


kni 


974.3 


2,103 1,270 


14 


22.6 


31.1 


54.4 


krl 


844.7 


12,328 4,223 


727 


27.7 


33.9 


61.3 


kr2 


883.2 


11,588 4,049 


687 


27.5 


33.6 


60.5 



^ 'Near promoter' is within 1,000 bp upstream and 200 bp downstream as in [^ 



For all proteins, we call shorter peaks than MACS. While the percentages 
of MACS peaks near a promoter and overlapping genes/exons are greater than 
those for T-PIC, we predict substantially more peaks and a reasonable percent- 
age of them are still located in gene regions. Moreover, for our called peaks 
that do not overlap peaks called by MACS, a fair number of them fall within 
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1000 bp of a MACS peak, as shown in Table |3] Figure |6] shows called peaks for 
three of the transcription factors in the even skipped (eve) and snail (sna) loci 
along with the coverage function for each transcription factor. The binding for 
these two well-characterized loci has been previously studied [21 . In many cases, 
our peaks subdivide those called by MACS, which illustrates that our method 
is more sensitive to the presence of multiple binding sites. In other cases, we 
narrow a peak called by MACS and give a more specific location to the DNA 
binding. In Figure [6J we additionally find a few new examples of binding sites 
that MACS missed. 



jlIJlI 
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Fig. 6. Peaks called by both MACS and T-PIC for thee transcription factors — 
hunchback antibody 1 (hbl), knirps (kni), and kruppel antibody 2 (kr2) — in the even 
skipped (A) and snail (B) loci are shown. The plotted coverage function for each tran- 
scription factor suggests multiple peaks, and our method better "sees" the multiple 
binding sites. 



4 Conclusion 



We have developed a novel approach to the analysis of ChlP-Seq data, that aims 
to discover bound regions of DNA by topological analysis of read coverage func- 
tions. Our method-T-PIC-can be used with or without an input control, is fast, 
and freely available, making it suitable for general use. The approach compares 
favorably to two popular peak callers: PeakSeq and MACS. We find the major- 
ity of their called peaks while detecting additional sites of binding, improving 
upon percentages (PeakSeq), and separating peaks into multiple binding sites 
(MACS). Although we have focused on ChlP-Seq in this paper, the approach we 
describe to call peaks could also be of use in the analysis of other sequence-based 
assays. 
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